Update: Findings from this notebook were summarized in the Medium blogpost

Intro

In the curated COVID-19 datasets there are mainly cumulative descriptions of larger populations, e.g. country-level counts of infected people.

Being valuable, it tells us little about an age/gender group our parents belong to, or the group of our partners, or our siblings' age/gender.

In [1]:
from IPython.core.display import Image, display, HTML
display(Image("../input/helpingillustrations/ntk_pic.png"))
display(HTML("<style>.container { width:100% !important; }</style>"))

drawing credit: Dan Perjovschi

In this notebook I'll focus on person-level gender and age information.

In addition to the Roche UNCOVER Challenge I'm bringing detailed data about contracted cases in Czech republic. Please, see the dataset page for more details.

Goals

  • Firstly, we'll do exploratory data analysis. In the EDA the main goal would be to drill down into the persol-level details as much as possible. The detailed drill down would be illustrated on Czechia due to better data at hand.

  • Secondly, we'll target the following important question:
    Does COVID-19 attack all age-gender groups evenly?

  • Thirdly, we'll attempt to quantitatively estimate by how much are selected age-gender groups more at risk of contracting COVID-19.

A note before we dive into:

The data are likely to be biased towards people who not only contracted COVID-19, but at the same time had moderate, severe, or critical symptoms. Let's still say "contracting COVID-19 instead of more precise but verbose contracting COVID-19 and having moderate, severe, or critical symptoms

Content

  1. Libraries
  2. EDA: The Czech Republic COVID-19 cases
  3. EDA: Canada COVID-19 cases
  4. Does COVID-19 attack all age-gender groups evenly?
  5. Groups at risk of contracting COVID-19 based on data
  6. Conservative Detection of age/gender groups with higher risk of contracting COVID-19
  7. Conclusions

Libraries

In [2]:
# !pip install chart_studio
In [3]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
try:
    from tqdm.notebook import tqdm
except:
    from tqdm import tqdm_notebook as tqdm
import requests
import sys
from itertools import chain
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.express as px
import plotly.io as pio
# import chart_studio
# import chart_studio.plotly as py
from scipy.stats import chisquare
from statsmodels.stats.proportion import multinomial_proportions_confint, proportion_confint
from statsmodels.stats.multitest import multipletests
import warnings
%matplotlib inline
In [4]:
export_plotly_figs_online = False
if export_plotly_figs_online:
    username = '' # your username
    api_key = '' # your api key - go to profile > settings > regenerate key
    chart_studio.tools.set_credentials_file(username=username, api_key=api_key)

EDA: The Czech Republic COVID-19 cases

Reading and formating COVID-19 data

In [5]:
df_czechia = pd.read_csv('../input/covid19-the-czech-republic-personlevel-data/CZ_COVID19_persons_2020_04_13.csv')
df_czech_regions = pd.read_csv('../input/covid19-the-czech-republic-personlevel-data/CZ_region_codes.csv', index_col='region_code_NUTS3')
df_czechia['region_name'] = df_czechia['CZ_region_code_NUTS3'].map(df_czech_regions['region_name'])
df_czechia['gender'] = df_czechia['gender'].map(lambda x: 'Female' if x == 'Z' else 'Male' if x == 'M' else np.nan)
df_czechia['report_date'] = pd.to_datetime(df_czechia['report_date'])
df_czechia['report_date'] = df_czechia['report_date'].map(lambda x: x.date())
df_czechia.head()
Out[5]:
report_date age gender CZ_region_code_NUTS3 imported_case country_of_exposure_csu_code region_name
0 2020-02-29 67 Male CZ010 1.0 IT Prague
1 2020-03-01 21 Female CZ010 1.0 IT Prague
2 2020-02-29 20 Female CZ010 1.0 IT Prague
3 2020-03-03 41 Female CZ010 1.0 IT Prague
4 2020-03-03 44 Male CZ010 1.0 IT Prague

Checking data quality

Missing values

In [6]:
df_czechia.isnull().sum()
Out[6]:
report_date                        0
age                                0
gender                             0
CZ_region_code_NUTS3               0
imported_case                   5193
country_of_exposure_csu_code    5193
region_name                        0
dtype: int64
In [7]:
sum(df_czechia['imported_case'].isnull() | df_czechia['country_of_exposure_csu_code'].isnull())/df_czechia['imported_case'].isnull().sum()
Out[7]:
1.0

The attributes imported_case and country_of_exposure_csu_code have missing values sumaltenously only. Missing value means local exposure to COVID-19, inside Czechia.

Fixing Exposure Codes

In [8]:
df_czechia[df_czechia['country_of_exposure_csu_code'] == 'CZ']
Out[8]:
report_date age gender CZ_region_code_NUTS3 imported_case country_of_exposure_csu_code region_name
3003 2020-03-30 40 Female CZ042 1.0 CZ Ústí nad Labem Region
3004 2020-03-30 59 Female CZ042 1.0 CZ Ústí nad Labem Region
3044 2020-03-31 57 Female CZ042 1.0 CZ Ústí nad Labem Region
3068 2020-03-31 3 Male CZ042 1.0 CZ Ústí nad Labem Region

These four records state exposure to COVID-19 outside The Czech Republic, while listing The Czech Republic as the country of exposure. Let's remove these noisy records, so that all cases of exposure inside CZ have consistent format.

In [9]:
df_czechia.loc[df_czechia['country_of_exposure_csu_code'] == 'CZ', 'imported_case'] = np.nan
df_czechia.loc[df_czechia['country_of_exposure_csu_code'] == 'CZ', 'country_of_exposure_csu_code'] = np.nan

Czech regions drill down: from overall stats down to who, when, and where

Overall age distributions per gender

In [10]:
plt.figure(figsize=(14, 7))
bins = df_czechia['age'].quantile(np.arange(0, 1.001, 0.1))
sns.distplot(df_czechia.loc[df_czechia['gender'] == 'Female', 'age'], bins=bins)
sns.distplot(df_czechia.loc[df_czechia['gender'] == 'Male', 'age'], bins=bins)
plt.legend(['Female', 'Male'])
plt.title('Czechia: age distributions of COVID-19 cases per gender', fontsize=21)
Out[10]:
Text(0.5, 1.0, 'Czechia: age distributions of COVID-19 cases per gender')

We can observe two interesting differences between the distributions of Females vs. Males.  Firstly, it seems that Females in their 30's contract COVID-19 less frequently compared to Males in the same age group. The same holds for Females vs. Males in their 70's. Moreover, Women in there 30's have less corona cases than women in their 20's. We'll later check general population proportions and see if these observations can be due to demographics subtleties, sit tight!

back to Content

Overall COVID-19 cases per Czech regions

In [11]:
plt.rcParams.update({'font.size': 15})

fig, ax = plt.subplots(figsize=(16, 7))
region_values = df_czechia['region_name'].value_counts(ascending=True)
region_values.plot(kind='barh', color=['dodgerblue'])

for i, v in enumerate(region_values):
    ax.text(v + 1, i - 0.2, str(v), color='dodgerblue', fontweight='bold')
plt.title('Czechia: COVID-19 cases per regions', fontsize=21)
Out[11]:
Text(0.5, 1.0, 'Czechia: COVID-19 cases per regions')

Oh, Prague.. It should be noted that based on official stats, overall population of Central Bohemian region is slightly higher than population of Prague. Yet, the population density in Prague is more than 30x higher.

back to Content

Who and Where? Age and gender of COVID-19 patients for each region

What are age distributions per each region?

To answer this question, let's first create age groups.

In [12]:
df_czechia['age_group'] = pd.cut(df_czechia['age'], 
                                 bins = [0, 20] + list(range(30, 91, 10)) + [110],
                                 right=False
                                )

Now, we'll create heatmap summarizing age distributions per region. Each row would represent one region, and columns would represent age groups. This way the heatmap basically provides info of the 3D histogram (I think of it as looking on the 3D histogram from above)

In [13]:
def plot_summarization_heatmap(gender, title, 
                               bool_index=None, 
                               second_dim_col='report_date', 
                               color='red', 
                               fig_width=800, fig_height=400, 
                               log_scale=False,
                               upload_fig_filename=None
                              ):
    """
    Plots distribution of COVID-19 cases per Age group and Date as a heatmap.
    Date is on the X axis, each row corresponds to a particular age group.
    
    @param gender: string specifying gender of the visualization
    @param title: plot title
    @param bool_index: (optional) pandas Series or numpy array of booleans for boolean indexing of the Czechia dataframe
    @param second_dim_col: (optional) column name for the Y axis of the heatmap, expected values are either 'report_date' (default), or 'Country Name', or 'region_name'    
    @param color: (optional) color for the colorscale, expected values are either 'red' (default), or 'amp'
    @param fig_width: (optional) width of the plotly figure
    @param fig_height: (optional) height of the plotly figure
    @param log_scale: (optional) boolean flag to display color on log-scale
    @param upload_fig_filename: (optional) when provided, the figure will be uploaded to your account with filename upload_fig_filename. It is assumed that plotly credentials has been set
    """
    # checking inputs
    assert color in ['red', 'amp'], "expected values for color are either 'red' (default), or 'amp'"
    assert second_dim_col in ['report_date', 'Country Name', 'region_name'], "expected values for second_dim_col are either 'report_date' (default), or 'Country Name', or 'region_name'"
    if bool_index is None:
        bool_index = np.ones(len(df_czechia), dtype=bool)
    vmax = (df_czechia[bool_index]
                  .groupby(['age_group', second_dim_col, 'gender'])['age']
                  .count()
                  .max())
    bool_index &= df_czechia['gender']==gender

    plt.figure(figsize=(14, 4))
    values = (df_czechia[bool_index]
              .groupby(['age_group', second_dim_col])['age']
              .count()
              .unstack()
              .fillna(0)
             )
    values.index = list(map(str, values.index))
    if second_dim_col != 'report_date':
        values = values.T
        x_label = 'Age group [years]'
        y_label = f"{'Region' if second_dim_col == 'region_name' else 'Country'} Name"
    else:
        y_label = 'Age group [years]'
        x_label = 'Date'
    cmap = px.colors.sequential.Reds if color == 'red' else px.colors.sequential.amp
    vmax_vis = round(vmax*4)
    if log_scale:
        colorscale= [[min(1, count_val/vmax_vis) if count_val > 0 else 0, color] 
                     for count_val, color in zip([0] + list(np.logspace(0, np.log10(vmax_vis), len(cmap) - 1)), cmap)]
    else:
        colorscale= [[min(1.0, count_val/vmax_vis) if count_val > 0 else 0, color]
                     for count_val, color in zip(np.linspace(0, vmax_vis, len(cmap)), cmap)]
    fig = go.Figure(data=go.Heatmap(z=values.values,
                                    zmin=0, 
                                    zmax=vmax,
                                    hovertemplate=str(x_label) + ': %{x}<br>' + str(y_label) + ': %{y}<br>Count: %{z}<extra></extra>',
                                    x=values.columns,
                                    y=values.index,
                                    colorscale=colorscale,
                                    colorbar = dict(title='Count'))
                   )
    fig.update_xaxes(tickangle=45)
    fig.update_layout(title=title, height=fig_height, width=fig_width, 
                      xaxis_nticks=len(values.columns), yaxis_nticks=len(values.index), 
                      xaxis=dict(title=x_label), yaxis=dict(title=y_label),
                      plot_bgcolor='rgba(0, 0, 0, 0)', paper_bgcolor='rgba(0, 0, 0, 0)',
                      margin={"r":0,"t":30,"l":0,"b":0, 'pad': 2}, )
    if not upload_fig_filename is None:
        py.plot(fig, filename = upload_fig_filename)
    fig.show()
In [14]:
plot_summarization_heatmap('Female', 'Czechia, Females: cases in age groups per regions', 
                           second_dim_col='region_name',
                           fig_width=600, fig_height=400,
                           color='amp',
                           upload_fig_filename='cz_females_age_region' if export_plotly_figs_online else None
                          )
<Figure size 1008x288 with 0 Axes>

Interestingly, it seems that Olomouc distribution is shifted towards younger women, while Ústí nad Labem seem to have slight shift towards older ladies.

When comparing official population statistics based on this dataset, average female age doesn't differ significantly for Prague, Olomouc region and Ústí nad Labem region, being 43.3, 43.2,a 44.1 respectively. Olomouc mean is actually higher. However, Olomouc is known to be a "student" city with up to 1/5 of all people there being students from all over the country/world (reference in Czech). So, the shift in Olomouc might be actually due to people studying there, but with official residence in a different region of the Republic.

There are many students in Prague as well, which might be behind higher count of infected women between 20-29 compared to women between 30-39.

Note: as the observations might suggest migration due to studies/work, for the planned statistical tests we'll use age-gender groups as categories, without further detalization into regions.

Analogously, let's check age distributions per region for men:

In [15]:
plot_summarization_heatmap('Male', 'Czechia, Males: cases in age groups per regions', 
                           second_dim_col='region_name',
                           fig_width=600, fig_height=400,
                           color='amp',                           
                           upload_fig_filename='cz_males_age_region' if export_plotly_figs_online else None
                          )
<Figure size 1008x288 with 0 Axes>

We've previously seen that Women in their 30's have less COVID-19 cases compared to women in 20's. Now we can see that the same also holds for Men in several regions!

back to Content

When and Where? COVID-19 patients per each region in time

Computing total counts of COVID-19 positive cases per date in each region. To ensure that each region has a record for each date, we transform long-format data to the wide format and reindex by the date range, then filling missing values forward (so that days without new reported cases also have record), and then stack the dataframe back into the long format.

In [16]:
df_czechia_in_time = (df_czechia
                      .groupby(['region_name', 'report_date'])['age'].count()
                      .groupby(level=0)
                      .cumsum()
                     )

dates_available = df_czechia_in_time.index.get_level_values(1)
start_date = dates_available.min()
last_date = dates_available.max()
all_dates = pd.date_range(start_date, last_date)

df_czechia_in_time_wide = df_czechia_in_time.unstack(level=0)
df_czechia_in_time_wide = df_czechia_in_time_wide.reindex(all_dates)
# if there were no cases for the first date, fill in zero
df_czechia_in_time_wide.loc[start_date, df_czechia_in_time_wide.loc[start_date].isnull()] = 0 
df_czechia_in_time_wide.fillna(method='ffill', inplace=True)

df_czechia_in_time = df_czechia_in_time_wide.stack().reset_index().rename(columns={'level_0': 'report_date', 0: 'Total number of COVID-19 cases'})

Let's validate our data and the aggregation procedure by comparing the overall aggregation we have with the available cumulative data.

In [17]:
open_data = pd.read_csv('../input/novel-corona-virus-2019-dataset/time_series_covid_19_confirmed.csv')
cz_open_data = (open_data.loc[open_data['Country/Region'] == 'Czechia']
                .drop(['Lat', 'Long', 'Country/Region'], axis=1)
                .stack()
                .droplevel(level=0))
cz_open_data.index = pd.to_datetime(cz_open_data.index)
df_czechia_in_time_cumul = df_czechia_in_time.groupby('report_date')['Total number of COVID-19 cases'].sum()

plt.figure(figsize=(10,5))
ax = plt.axes()
cz_open_data.plot(ax=ax, linewidth=3, marker='o', label='Novel-Corona dataset summary data')
df_czechia_in_time_cumul.plot(ax=ax, linewidth=3, marker='o', label='Aggregation of person-level data')
plt.legend()
plt.title('Total cumulative count of COVID-19 cases', fontsize=21)
Out[17]:
Text(0.5, 1.0, 'Total cumulative count of COVID-19 cases')

We can see that the overall trend is the same and the differences are minor, which provides us the validation we sought. In the public media it was reported that the individual data has been post-processed and cleaned, with multiple tests of the same person being omitted, it can explain the observed minor difference and would suggest that the aggregations based on person-level data might be more precise.

The following plotly visualizations are inspired with this amazing kernel.

Geographical data for plotting would be taken from https://github.com/deldersveld/topojson/blob/master/countries/czech-republic/czech-republic-regions.json.

In [18]:
# getting geo data for plotting
r = requests.get(url='https://raw.githubusercontent.com/deldersveld/topojson/master/countries/czech-republic/czech-republic-regions.json')
topology = r.json()
In [19]:
# Convert topology json into geojson
#The code is from https://gist.github.com/perrygeo/1e767e42e8bc54ad7262
def rel2abs(arc, scale=None, translate=None):
    """Yields absolute coordinate tuples from a delta-encoded arc.
    If either the scale or translate parameter evaluate to False, yield the
    arc coordinates with no transformation."""
    if scale and translate:
        a, b = 0, 0
        for ax, bx in arc:
            a += ax
            b += bx
            yield scale[0]*a + translate[0], scale[1]*b + translate[1]
    else:
        for x, y in arc:
            yield x, y

def coordinates(arcs, topology_arcs, scale=None, translate=None):
    """Return GeoJSON coordinates for the sequence(s) of arcs.
    
    The arcs parameter may be a sequence of ints, each the index of a
    coordinate sequence within topology_arcs
    within the entire topology -- describing a line string, a sequence of 
    such sequences -- describing a polygon, or a sequence of polygon arcs.
    
    The topology_arcs parameter is a list of the shared, absolute or
    delta-encoded arcs in the dataset.
    The scale and translate parameters are used to convert from delta-encoded
    to absolute coordinates. They are 2-tuples and are usually provided by
    a TopoJSON dataset. 
    """
    if isinstance(arcs[0], int):
        coords = [
            list(
                rel2abs(
                    topology_arcs[arc if arc >= 0 else ~arc],
                    scale, 
                    translate )
                 )[::arc >= 0 or -1][i > 0:] \
            for i, arc in enumerate(arcs) ]
        return list(chain.from_iterable(coords))
    elif isinstance(arcs[0], (list, tuple)):
        return list(
            coordinates(arc, topology_arcs, scale, translate) for arc in arcs)
    else:
        raise ValueError("Invalid input %s", arcs)

def geometry(obj, topology_arcs, scale=None, translate=None):
    """Converts a topology object to a geometry object.
    
    The topology object is a dict with 'type' and 'arcs' items, such as
    {'type': "LineString", 'arcs': [0, 1, 2]}.
    See the coordinates() function for a description of the other three
    parameters.
    """
    return {
        "type": obj['type'], 
        "coordinates": coordinates(
            obj['arcs'], topology_arcs, scale, translate )}

from shapely.geometry import asShape

topojson_path = sys.argv[1]
geojson_path = sys.argv[2]


# file can be renamed, the first 'object' is more reliable
layername = list(topology['objects'].keys())[0]  

features = topology['objects'][layername]['geometries']
scale = topology['transform']['scale']
trans = topology['transform']['translate']

fc = {'type': "FeatureCollection", 'features': []}

for id, tf in enumerate(features):
    f = {'id': id, 'type': "Feature"}
    f['properties'] = tf['properties'].copy()

    geommap = geometry(tf, topology['arcs'], scale, trans)
    geom = asShape(geommap).buffer(0)
    assert geom.is_valid
    f['geometry'] = geom.__geo_interface__

    fc['features'].append(f) 

Manually mapping the regions to map ids.

In [20]:
region_name_2_map_id = {'Ústí nad Labem Region': 0,
                        'South Bohemian Region': 1,
                        'South Moravian Region': 2,
                        'Karlovy Vary Region': 3,
                        'Hradec Králové Region': 4,
                        'Vysočina Region': 5,
                        'Liberec Region': 6,
                        'Moravian-Silesian Region': 7,
                        'Olomouc Region': 8,
                        'Pardubice Region': 9,
                        'Plzeň Region': 10,                        
                        'Prague': 11,
                        'Central Bohemian Region': 12,
                        'Zlín Region': 13}
In [21]:
df_czechia_in_time['map_id'] = df_czechia_in_time['region_name'].map(region_name_2_map_id)
df_czechia_in_time['Date'] = df_czechia_in_time['report_date'].map(lambda x: x.strftime('%Y-%m-%d'))
In [22]:
fig = px.choropleth(df_czechia_in_time,
                    geojson=fc,
                    locations='map_id',
                    animation_frame='Date',
                    color_continuous_scale=px.colors.sequential.amp,
                    hover_name='region_name',
                    range_color=(0, df_czechia_in_time['Total number of COVID-19 cases'].max()),
                    color='Total number of COVID-19 cases', 
                    title='Czechia: COVID-19 patients per regions'
                   )

fig.update_geos(fitbounds="locations", visible=True)
fig.update_geos(projection_type="orthographic")
fig.update_layout(height=600, width=700, margin={"r":0,"t":30,"l":0,"b":0, 'pad': 2})
fig.show()
In [23]:
pio.write_html(fig, file='cz_covid_regions.html')

On the map above we're able to check precise numbers. Let's also animate log of the total COVID-19 cases to better see the spread in early days.

In [24]:
df_czechia_in_time['Log_2 of total number of COVID-19 cases'] = df_czechia_in_time['Total number of COVID-19 cases'].map(lambda x: np.log2(x) if x != 0 else 0)
fig = px.choropleth(df_czechia_in_time,
                    geojson=fc,
                    locations='map_id',
                    animation_frame='Date',
                    color_continuous_scale=px.colors.sequential.amp,
                    hover_data=['region_name', 'Total number of COVID-19 cases'],
                    range_color=(0, df_czechia_in_time['Log_2 of total number of COVID-19 cases'].max()),
                    color='Log_2 of total number of COVID-19 cases', 
                    title='Czechia: COVID-19 patients per regions (log scale)'
                   )

fig.update_geos(fitbounds="locations", visible=True)
fig.update_geos(projection_type="orthographic")
fig.update_layout(height=600, width=700, margin={"r":0,"t":30,"l":0,"b":0, 'pad': 2})
fig.show()

back to Content

Who and When? COVID-19 patients per age-gender groups in time

Which age-gender groups contract COVID-19 more? And when?

Let's check how dynamics of new cases per age groups for women. On x axis we're going to have date, and each row corresponds to an age group.

In [25]:
plot_summarization_heatmap('Female', 'Czechia, Females: daily numbers of new COVID-19 cases',                           
                           upload_fig_filename='cz_females_age_date' if export_plotly_figs_online else None)
<Figure size 1008x288 with 0 Axes>

We can see that women between 40-49 have the most COVID-19 cases.

In [26]:
plot_summarization_heatmap('Male', 'Czechia, Males: daily numbers of new COVID-19 cases',                           
                           upload_fig_filename='cz_males_age_date' if export_plotly_figs_online else None)
<Figure size 1008x288 with 0 Axes>

The detailed visualization enables us to check, for instance, groups of older people. To compare genders, the color scale is the same in the both figures. We can see that in the beginning of epidemic men in their 60's had one of the highest daily counts of new COVID-19 cases among all age groups. We know that the Czech government has implemented additional measures of protecting older people rather early on. And in the data we can see that for people in 60's the daily increase of patients does not grow as it does for younger age groups, even though the beginning looked similar. For a week or so the older people counts actually went down. However, later the counts started to pop, especially for men it got closer to other age groups on the same date. Another troubling observation is the increase in the new daily counts for Females in their 40's. Thankfully, lately the daily increases for this group look better, but still the group remains a leader in new cases daily.

back to Content

What gender, When, and Where? Proportion of new female patients per each region in time

Let's compute and visualize weekly proportions of female COVID-19 patients in each Czech region.

In [27]:
# prepare_female_proportion_in_time_data
df_czechia['report_week'] = df_czechia['report_date'].map(lambda x: x.isocalendar()[1])
week_2_start_date = df_czechia.groupby('report_week')['report_date'].min()
df_czechia['Week Start Date'] = df_czechia['report_week'].map(week_2_start_date).map(str)
df_czechia.drop('report_week', axis=1, inplace=True)


df_czechia_gender_in_time = (df_czechia
                             .groupby(['region_name', 'Week Start Date'], as_index=False)['gender']
                             .agg(list)
                          )
df_czechia_gender_in_time['Proportion of New Female Patients [%]'] = df_czechia_gender_in_time['gender'].map(lambda x: int(100*sum([gender == 'Female' for gender in x])/len(x)))
df_czechia_gender_in_time['Patients count'] = df_czechia_gender_in_time['gender'].map(len)
df_czechia_gender_in_time.sort_values(by=['Week Start Date', 'region_name'], inplace=True)
df_czechia_gender_in_time['map_id'] = df_czechia_gender_in_time['region_name'].map(region_name_2_map_id)

# vizualize_female_proportion_in_time

fig = px.choropleth(df_czechia_gender_in_time,
                geojson=fc,
                locations='map_id',
                animation_frame='Week Start Date',
                color_continuous_scale=px.colors.diverging.RdBu_r,
                hover_data=['region_name', 'Patients count'],
                range_color=(0, 100),
                color='Proportion of New Female Patients [%]', 
                title='Czechia: Proportion of New Female patients per regions'
               )

fig.update_geos(fitbounds="locations", visible=True)
fig.update_geos(projection_type="orthographic")
fig.update_layout(height=600, width=700, margin={"r":0,"t":30,"l":0,"b":0, 'pad': 2})
fig.show()
In [28]:
pio.write_html(fig, file='cz_covid_what_gender_when_where.html')

The visualization enables us to see that counts of Female and Male patients differ, even for larger samples in later weeks, when there're many new COVID-19 patients during later weeks. The week started on the March 30 is especially troubling to me. In all regions there're more Females among new COVID-19-positive patients. Based on our drill-down to age groups, we know that people in 40's contribute to new cases the most. Yet, based on official demographics, there're less Females in this age category than Males. In total, there're slightly more Women (50.8%), but it cannot explain systematically higher proportions of new Female patients.

back to Content

How old, When, and Where? Average age of new patients per each region in time

Let's compute and visualize weekly mean of COVID-19 patients age in each Czech region.

In [29]:
df_czechia_age_in_time = (df_czechia
                             .groupby(['region_name', 'Week Start Date'], as_index=False)['age']
                             .agg({'Patients Count': len, 'Average Age of New Patients': np.mean})
                            )
df_czechia_age_in_time.sort_values(by=['Week Start Date', 'region_name'], inplace=True)
df_czechia_age_in_time['map_id'] = df_czechia_age_in_time['region_name'].map(region_name_2_map_id)


fig = px.choropleth(df_czechia_age_in_time,
                geojson=fc,
                locations='map_id',
                animation_frame='Week Start Date',
                color_continuous_scale='blues',
                hover_data=['region_name', 'Patients Count'],
                range_color=(0, df_czechia_age_in_time['Average Age of New Patients'].max()),
                color='Average Age of New Patients', 
                title='Czechia: Average Age of New Patients per Regions'
               )

fig.update_geos(fitbounds="locations", visible=True)
fig.update_geos(projection_type="orthographic")
fig.update_layout(height=600, width=700, margin={"r":0,"t":30,"l":0,"b":0, 'pad': 2})
fig.show()
In [30]:
pio.write_html(fig, 'cz_how_old_when_where.html')

back to Content

COVID-19 imports drill down: from an overview down to Who, When, and Where From

Overview: What countries did we bring it from?

Let's map the country codes to full names and check overall counts of COVID-19 cases imported into Czechia from each country.

Also, for easier visualization with plotly, let's map 2-letter country codes to 3-leter country codes.

In [31]:
df_country_codes = pd.read_csv('../input/covid19-the-czech-republic-personlevel-data/country_codes', sep=';')
country_code_mapping = df_country_codes.set_index('2let')['3let']
country_code_2_name_mapping = df_country_codes.set_index('2let')['Countrylet']
df_czechia['Country Code'] = df_czechia['country_of_exposure_csu_code'].map(country_code_mapping)
In [32]:
fig, ax = plt.subplots(figsize=(14, 19)) 
vals = (df_czechia.loc[df_czechia['imported_case']==1, 'Country Code']
 .value_counts(ascending=True)
)
vals.plot(kind='barh', color=['dodgerblue'], ax=ax)
plt.title('Czechia: contries of exposure for imported COVID-19 cases', fontsize=21)

for i, v in enumerate(vals):
    ax.text(v + 1, i - 0.25, str(v), color='dodgerblue', fontweight='bold')

back to Content

Overview: When did we bring it to The Czech Republic?

In [33]:
with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    df_czechia['imported_case'] = df_czechia['imported_case'].fillna(0)
    _, axes = plt.subplots(2, 1, figsize=(12, 10), sharex=True)
    propotion_of_imports = (df_czechia
                            .groupby('report_date')['imported_case']
                            .mean()
                            .map(lambda x: int(100*x)))
    proportion_of_local_cases = propotion_of_imports.map(lambda x: 100 - x)
    axes[1].bar(propotion_of_imports.index, propotion_of_imports, color='red', label='Imported cases')
    axes[1].bar(propotion_of_imports.index, proportion_of_local_cases, color='brown', bottom=propotion_of_imports, label='Local cases')
    axes[1].set_ylabel('Proportion out of all cases [%]')
    axes[1].legend()

    (df_czechia
     .groupby('report_date')['imported_case']
     .sum()).plot(ax=axes[0], linewidth=3, marker='o')
    axes[0].set_ylabel('Number of imported COVID-19 cases')
    axes[1].set_xlabel('Date')
    plt.xticks(rotation=30)
    plt.suptitle('Dynamics of Imported COVID-19 cases', fontsize=24)

back to Content

Where from and When? Counts of Czech patients per countries of exposure, In time

Computing and visualizing cumulative counts for each date.

In [34]:
df_czechia_imports_in_time = (df_czechia[df_czechia['imported_case']==1]
                              .groupby(['Country Code', 'report_date'])['age'].count()
                              .groupby(level=0)
                              .cumsum())

df_czechia_imports_in_time_wide = df_czechia_imports_in_time.unstack(level=0)
df_czechia_imports_in_time_wide = df_czechia_imports_in_time_wide.reindex(all_dates)
df_czechia_imports_in_time_wide.fillna(method='ffill', inplace=True)

df_czechia_imports_in_time = df_czechia_imports_in_time_wide.stack().reset_index().rename(columns={'level_0': 'report_date', 0: 'Total imported COVID-19 cases'})
df_czechia_imports_in_time['Date'] = df_czechia_imports_in_time['report_date'].map(lambda x: x.strftime('%Y-%m-%d'))
In [35]:
fig = px.choropleth(df_czechia_imports_in_time,
                    locations='Country Code',
                    animation_frame='Date',
                    hover_name='Country Code',
                    range_color=(0, df_czechia_imports_in_time['Total imported COVID-19 cases'].max()),
                    color='Total imported COVID-19 cases', 
                    title='Czechia: Cumulative Count of COVID-19 Imports per Country of Exposure'
                   )

fig.update_geos(fitbounds="locations", visible=True)
fig.update_layout(height=600, margin={"r":0,"t":30,"l":0,"b":0, 'pad': 2})
fig.show()
In [36]:
pio.write_html(fig, file='cz_covid_where_from_when_cumulative.html')

Computing weekly counts of new imported COVID-19 cases per exposure contry.

In [37]:
df_czechia_weekly_imports_in_time = (df_czechia[df_czechia['imported_case']==1]
                                     .groupby(['Country Code', 'Week Start Date'], as_index=False)['age']
                                     .count()
                                     .rename(columns={'age':'New imported COVID-19 cases'})
                                     )

df_czechia_weekly_imports_in_time.sort_values(by=['Week Start Date', 'Country Code'], inplace=True)
In [38]:
fig = px.choropleth(df_czechia_weekly_imports_in_time,
                    locations='Country Code',
                    animation_frame='Week Start Date',
                    hover_name='Country Code',
                    range_color=(0, df_czechia_weekly_imports_in_time['New imported COVID-19 cases'].max()),
                    color='New imported COVID-19 cases', 
                    title='Czechia: Weekly Imported COVID-19 cases per Country of Exposure'
                   )

fig.update_geos(fitbounds="locations", visible=True)
fig.update_layout(height=600, margin={"r":0,"t":30,"l":0,"b":0, 'pad': 2})
fig.show()
In [39]:
pio.write_html(fig, file='cz_covid_where_from_when_weekly.html')

back to Content

What Gender, and When? Proportion of Females among Patients Importing COVID-19, in Time

In [40]:
df_czechia_weekly_imports_gender = (df_czechia[df_czechia['imported_case']==1]
                                    .groupby('Week Start Date')['gender']
                                    .agg(list))

females_proportions = df_czechia_weekly_imports_gender.map(lambda x: int(100*sum([gender == 'Female' for gender in x])/len(x)))
males_proportions = df_czechia_weekly_imports_gender.map(lambda x: int(100*sum([gender == 'Male' for gender in x])/len(x)))
plt.figure(figsize=(12, 7))

xticks = list(range(len(females_proportions)))
plt.bar(xticks, males_proportions, color='tab:blue', edgecolor='white', label='Proportion of New Male Patients [%]')
plt.bar(xticks, females_proportions, bottom=males_proportions, 
        color='tab:pink', edgecolor='white', label='Proportion of New Female Patients [%]')
plt.plot([-.3] + xticks + [len(females_proportions)-.7], 50*np.ones(len(females_proportions) + 2), 
         linestyle='--', linewidth=3, c='red', label='50% line')

plt.xticks(xticks, females_proportions.index, rotation=30)
plt.xlabel('Week Start Date', fontsize=18)
plt.ylabel('[%]')
plt.legend(loc='upper left', bbox_to_anchor=(1,1))
plt.title('Czechia, Weekly gender proportions among patients importing COVID-19', fontsize=21)
Out[40]:
Text(0.5, 1.0, 'Czechia, Weekly gender proportions among patients importing COVID-19')
In [41]:
plt.figure(figsize=(12,7))
ax = sns.countplot(x='Week Start Date', hue='gender', data=df_czechia[df_czechia['imported_case']==1])
plt.xticks(rotation=30)
plt.legend(title='Gender')
plt.ylabel('Weekly Count', fontsize=18)
plt.title('Czechia, new imported cases per gender')
plt.show()

back to Content

Where from, What Gender, and When? Proportion of Females among Patients Importing COVID-19, per Country of Exposure, in Time

In [42]:
df_czechia_weekly_imports_gender_in_time = (df_czechia[df_czechia['imported_case']==1]
                                            .groupby(['Country Code', 'Week Start Date'], as_index=False)['gender']
                                            .agg(list)
                                     )

df_czechia_weekly_imports_gender_in_time['Proportion of New Female Patients [%]'] = df_czechia_weekly_imports_gender_in_time['gender'].map(lambda x: int(100*sum([gender == 'Female' for gender in x])/len(x)))
df_czechia_weekly_imports_gender_in_time['Patients Count'] = df_czechia_weekly_imports_gender_in_time['gender'].map(len)
df_czechia_weekly_imports_gender_in_time.sort_values(by=['Week Start Date', 'Country Code'], inplace=True)
\
fig = px.choropleth(df_czechia_weekly_imports_gender_in_time,
                    locations='Country Code',
                    animation_frame='Week Start Date',
                    color_continuous_scale=px.colors.diverging.RdBu_r,
                    hover_data=['Country Code', 'Patients Count'],
                    range_color=(0, 100),
                    color='Proportion of New Female Patients [%]', 
                    title='Czechia: Proportion of Females among Patients Importing COVID-19, per Country of Exposure'
                   )

fig.update_geos(fitbounds="locations", visible=True)
fig.update_layout(height=600, margin={"r":0,"t":30,"l":0,"b":0, 'pad': 2})
fig.show()
In [43]:
pio.write_html(fig, file='cz_covid_where_from_what_gender_when_weekly.html')

Italy: what Czech regions COVID-19 was imported to?

In [44]:
df_czechia_italy_in_time = (df_czechia[df_czechia['Country Code'] == 'ITA']
                             .groupby(['region_name', 'Week Start Date'], as_index=False)['gender']
                             .agg({'Count': 'count', 'Percentage of Females':lambda x: int(100*sum([gender == 'Female' for gender in x])/len(x))})
                            )
df_czechia_italy_in_time.sort_values(by=['Week Start Date', 'region_name'], inplace=True)
df_czechia_italy_in_time['map_id'] = df_czechia_italy_in_time['region_name'].map(region_name_2_map_id)


fig = px.choropleth(df_czechia_italy_in_time,
                geojson=fc,
                locations='map_id',
                animation_frame='Week Start Date',
                color_continuous_scale=px.colors.sequential.amp,
                hover_data=['region_name', 'Percentage of Females'],
                range_color=(0, df_czechia_italy_in_time['Count'].max()),
                color='Count', 
                title='COVID-19 imports from Italy: into which regions?'
               )

fig.update_geos(fitbounds="locations", visible=True)
fig.update_geos(projection_type="orthographic")
fig.update_layout(height=600, width=700, margin={"r":0,"t":30,"l":0,"b":0, 'pad': 2})
fig.show()

Austria: what Czech regions COVID-19 was imported to?

In [45]:
df_czechia_aut_in_time = (df_czechia[df_czechia['Country Code'] == 'AUT']
                             .groupby(['region_name', 'Week Start Date'], as_index=False)['gender']
                             .agg({'Count': 'count', 'Percentage of Females':lambda x: int(100*sum([gender == 'Female' for gender in x])/len(x))})
                            )
df_czechia_aut_in_time.sort_values(by=['Week Start Date', 'region_name'], inplace=True)
df_czechia_aut_in_time['map_id'] = df_czechia_aut_in_time['region_name'].map(region_name_2_map_id)


fig = px.choropleth(df_czechia_aut_in_time,
                geojson=fc,
                locations='map_id',
                animation_frame='Week Start Date',
                color_continuous_scale=px.colors.sequential.amp,
                hover_data=['region_name', 'Percentage of Females'],
                range_color=(0, df_czechia_aut_in_time['Count'].max()),
                color='Count', 
                title='COVID-19 imports from Austria: into which regions?'
               )

fig.update_geos(fitbounds="locations", visible=True)
fig.update_geos(projection_type="orthographic")
fig.update_layout(height=600, width=700, margin={"r":0,"t":30,"l":0,"b":0, 'pad': 2})
fig.show()

back to Content

What Age? Number of imported cases per age

In [46]:
def plot_imported_cases_per_age(gender):
    """
    The function computes and plots number of imported cases per each age group of the requested gender

    @param gender: a string specifying gender (valid values are 'Female'/'Male')
    """
    df_czechia['age_group_'] = pd.cut(df_czechia['age'], 
                                 bins = [0, 12] + list(range(17, 81, 5)) + [110],
                                 right=False
                                )
    plt.figure(figsize=(9,5))
    (df_czechia[df_czechia['gender'] == gender]
     .groupby('age_group_')['imported_case']
     .sum()
    ).plot(kind='bar')
    plt.xlabel('Age group [years]')
    plt.xticks(rotation=50)
    plt.ylabel('Number of imported cases')
    plt.ylim((0, df_czechia
                  .groupby(['gender', 'age_group_'])['imported_case']
                  .sum()
                  .max()*1.1))
    df_czechia.drop('age_group_', axis=1, inplace=True)
    plt.title(f'Czechia, {gender}s: Number of imported cases per age', fontsize=18)
In [47]:
plot_imported_cases_per_age('Female')
plot_imported_cases_per_age('Male')

back to Content

Where from, How old, and When? Average Age of Patients Importing COVID-19, per Country of Exposure, in Time

In [48]:
df_czechia_weekly_imports_age_in_time = (df_czechia
                                         .groupby(['country_of_exposure_csu_code', 'Week Start Date'], as_index=False)['age']
                                         .agg({'Patients Count': len, 'Average Age of New imported COVID-19 cases': np.mean})
                                     )


df_czechia_weekly_imports_age_in_time.sort_values(by=['Week Start Date', 'country_of_exposure_csu_code'], inplace=True)
df_czechia_weekly_imports_age_in_time['Country Code'] = df_czechia_weekly_imports_age_in_time['country_of_exposure_csu_code'].map(country_code_mapping)
fig = px.choropleth(df_czechia_weekly_imports_age_in_time,
                    locations='Country Code',
                    animation_frame='Week Start Date',
                    color_continuous_scale='blues',
                    hover_data=['Country Code', 'Patients Count'],
                    range_color=(0, df_czechia_weekly_imports_age_in_time['Average Age of New imported COVID-19 cases'].max()),
                    color='Average Age of New imported COVID-19 cases', 
                    title='Czechia: Average Age of Patients Importing COVID-19, per Country of Exposure'
                   )

fig.update_geos(fitbounds="locations", visible=True)
fig.update_layout(height=600, margin={"r":0,"t":30,"l":0,"b":0, 'pad': 2})
pio.write_html(fig, file='cz_covid_where_from_how_old_when_weekly.html')
fig.show()

back to Content

What Age, What Gender, and Where from? Counts of new imported cases per patient gender and age group, per country of exposure

In [49]:
df_czechia['Country Name'] = df_czechia['country_of_exposure_csu_code'].map(country_code_2_name_mapping)
plot_summarization_heatmap('Female', 'Czech Females: Counts of Imported Cases Per Country', 
                           bool_index=df_czechia['imported_case']==1, second_dim_col='Country Name',
                           fig_width=500, fig_height=700, 
                           upload_fig_filename='cz_imported_females_age_country' if export_plotly_figs_online else None
                          )
<Figure size 1008x288 with 0 Axes>
In [50]:
plot_summarization_heatmap('Male', 'Czech Males: Counts of Imported Cases Per Country', 
                           bool_index=df_czechia['imported_case']==1, second_dim_col='Country Name',
                           fig_width=500, fig_height=700, 
                           upload_fig_filename='cz_imported_males_age_country' if export_plotly_figs_online else None
                          )
<Figure size 1008x288 with 0 Axes>

back to Content

What Age, What Gender, and When? Daily counts of new imported cases per patient gender and age group

In [51]:
plot_summarization_heatmap('Female', 'Czechia, Females: Daily Counts of New Imported Cases', 
                           bool_index=df_czechia['imported_case']==1,                           
                           upload_fig_filename='cz_imported_females_age_date' if export_plotly_figs_online else None)
<Figure size 1008x288 with 0 Axes>
In [52]:
plot_summarization_heatmap('Male', 'Czechia, Males: Daily Counts of New Imported Cases', 
                           bool_index=df_czechia['imported_case']==1,                           
                           upload_fig_filename='cz_imported_males_age_date' if export_plotly_figs_online else None)
<Figure size 1008x288 with 0 Axes>

Interesting to see, that young men were importing COVID-19 as a leading age groupy mainly in the beginning, while young women are consistently importing COVID-19 as one of the most active female age groups.

Final data prep

Before moving on to Canada data, let's compute COVID-19 age-gender group proportions and reorganize data into long format, so that it'd be more straightforward to include general population proportions later on.

In [53]:
counts_1 = df_czechia.loc[df_czechia['gender'] == 'Female', 'age_group'].value_counts()
counts_2 = df_czechia.loc[df_czechia['gender'] == 'Male', 'age_group'].value_counts()
sample_size_cz = np.sum(counts_1) + np.sum(counts_2)
counts_1 /= 0.01*sample_size_cz
counts_2 /= 0.01*sample_size_cz
y_max = 1.1 * max(max(counts_1), max(counts_2))
        
all_counts = pd.DataFrame({'Female': counts_1, 'Male': counts_2})
df_proportions_czechia = (all_counts
                          .stack()
                          .reset_index()
                          .rename(columns={'level_0':'age_group', 'level_1':'gender', 0: 'Sample proportion [%]'}))
df_proportions_czechia['age_group'] = df_proportions_czechia['age_group'].astype(str)

back to Content

EDA: Canada COVID-19 cases

We'll use the Roche UNCOVER Challenge data about Canada.

In [54]:
#Let's start with the `covid_19_canada_open_data_working_group` dataset.
df_canada_open = pd.read_csv('../input/uncover/UNCOVER/covid_19_canada_open_data_working_group/public-covid-19-cases-canada.csv')
In [55]:
df_canada_open.shape
Out[55]:
(17167, 12)

Available age and gender info

In [56]:
df_canada_open['age'].value_counts().plot(kind='barh', color=['dodgerblue'])
plt.title('Available Age Info in the Canada Open Data dataset')
Out[56]:
Text(0.5, 1.0, 'Available Age Info in the Canada Open Data dataset')
In [57]:
# df_canada_tracker = pd.read_csv('../input/uncover/covid_tracker_canada/covid-19-tracker-canada.csv')
# print(f'Canada tracker dataset shape: {df_canada_tracker.shape}')
# df_canada_tracker['age'].value_counts().plot(kind='barh')
# plt.title('Available age info in the Canada COVID-19 Tracker Dataset')

Unfortunately, the same hold for covid_tracker_canada dataset: for the majority of records age is missing. COVID-19 tracker for Canada seems to provide similar info, but without gender info and it has slighty less records. So, let's use the covid_19_canada_open_data_working_group dataset.

Checking and fixing age data quality

In [58]:
print(f"Count of values in each age category: \n{df_canada_open['age'].value_counts()[1:]}")
Count of values in each age category: 
50-59    232
60-69    201
30-39    178
40-49    164
20-29    154
70-79     99
80-89     42
90-99     20
<20       17
<18        6
<10        5
10-19      2
50         1
2          1
<1         1
61         1
Name: age, dtype: int64

Given the counts, let's create a category 0-19, and unify all age records.

In [59]:
def fix_age_val(bool_idx, new_val):
    df_canada_open.loc[bool_idx, 'age'] = new_val
    
fix_age_val(df_canada_open['age'].isin({'<18', '<1', '2', '<10', '<20', '10-19'}), '0-19')
fix_age_val(df_canada_open['age'] == '61', '60-69')
fix_age_val(df_canada_open['age'] == '50', '50-59')
In [60]:
print(f"Count of values in each age category: \n{df_canada_open['age'].value_counts()[1:]}")
Count of values in each age category: 
50-59    233
60-69    202
30-39    178
40-49    164
20-29    154
70-79     99
80-89     42
0-19      32
90-99     20
Name: age, dtype: int64
In [61]:
df_canada_open.loc[df_canada_open['age'] != 'Not Reported', 'sex'].value_counts().plot(kind='barh', color=['dodgerblue'])
plt.title('Available gender info for known age in the Open Data')
Out[61]:
Text(0.5, 1.0, 'Available gender info for known age in the Open Data')

For the majority of cases with reported age, gender is known as well. It is good news as it would enable us to conduct the detailed analysis.

Total number of datapoints with known gender and sex:

In [62]:
sample_size_canada = sum((df_canada_open['age'] != 'Not Reported') & (df_canada_open['sex'] != 'Not Reported'))
print(f'Number of cases with known age and sex: {sample_size_canada}')
Number of cases with known age and sex: 978

Thankfully, we have at least something available to work with.

In [63]:
female_bool_idx = (df_canada_open['age'] != 'Not Reported') & (df_canada_open['sex'] == 'Female')
male_bool_idx = (df_canada_open['age'] != 'Not Reported') & (df_canada_open['sex'] == 'Male')
min(df_canada_open.loc[female_bool_idx, 'age'].value_counts().min(),
    df_canada_open.loc[male_bool_idx, 'age'].value_counts().min())
Out[63]:
6
In [64]:
female_age_group_sizes = df_canada_open.loc[female_bool_idx, 'age'].value_counts(dropna=False)
print(f"Number of persons in small age categories for women: \n{female_age_group_sizes[female_age_group_sizes<10]}")
male_age_group_sizes = df_canada_open.loc[male_bool_idx, 'age'].value_counts(dropna=False)
print(f"\n\nNumber of persons in small age categories for men: \n{male_age_group_sizes[male_age_group_sizes<10]}")
Number of persons in small age categories for women: 
Series([], Name: age, dtype: int64)


Number of persons in small age categories for men: 
90-99    6
Name: age, dtype: int64

The age group 90-99 will have to be joined for men and women, to ensure more than 10 representatives for each group.

For other groups we can conduct analysis separately for men and women.

back to Content

Age distributions of COVID-19 cases per gender

In [65]:
joined_age_groups = ['90-99']

gender_group_list = []
age_group_list = []
group_size_list = []
numeric_age_groups = sorted([group_name for group_name in df_canada_open['age'].unique() if group_name != 'Not Reported'])

for age_group in numeric_age_groups:
    female_count = female_age_group_sizes[age_group]
    male_count = male_age_group_sizes[age_group]
    if age_group in joined_age_groups:
        gender_group_list.append('both-genders')
        joined_count = female_count + male_count
        group_size_list.append(joined_count)
        age_group_list.append(age_group)
    else:
        gender_group_list.extend(['Females', 'Males'])
        age_group_list.extend([age_group, age_group])
        group_size_list.extend([female_count, male_count])
        
df_proportions_canada = pd.DataFrame({'gender': gender_group_list, 'age_group': age_group_list, 
                                       'Sample proportion [%]': [100*group_size/sample_size_canada for group_size in group_size_list]})

pd.pivot_table(df_proportions_canada, index='age_group', columns=['gender'], values='Sample proportion [%]').plot.bar(rot=90, figsize=(12, 8))
plt.title('Canada: Age distributions of COVID-19 cases per gender', fontsize=18)
plt.ylabel('Proportion [%]')
Out[65]:
Text(0, 0.5, 'Proportion [%]')

back to Content

Does COVID-19 attack all age-gender groups evenly?

We're going to formulate a null hypothesis that COVID-19 attacks any person at random, not taking age-gender grouping into account.

We'll see that the null hypothesis will be rejected with a very low type I error. It's formal statistical evidence that there are age-gender population groups with significantly higher risk of contracting COVID-19.

The COVID-19 news can easily bring one down. Yet, to stay effective in this fight agains corona, let's occasionally lift our spirits with some stats jokes

In [66]:
%%HTML
<img src=https://i.pinimg.com/originals/a1/ee/ff/a1eeffe12fd3db681208ad5f0387905c.png width="150" align="left">

The Czech Republic

General population data

I've downloaded the general population data for Czechia and Canada, and put them into public datasets.

In [67]:
def get_czech_population_gender_all(path_to_group_file):
    """
    A function for loading The Czech Republic Age Gender Demographics 2018 input file.

    @param path_to_group_file: A path to input file. In the dataset each gender info is stored in a separate file
    @return: a pandas DataFrame with columns 'Age', 'VALUE'
    """
    czech_population_group = pd.read_csv(path_to_group_file, low_memory=False)
    czech_population_group.loc[czech_population_group['Age'] == '100+', 'Age'] = 100
    czech_population_group['Age'] = czech_population_group['Age'].map(int)
    czech_population_group['CZ0\nČR'] = czech_population_group['CZ0\nČR'].map(lambda x: int(x.replace(',','')))
    return czech_population_group[['Age', 'CZ0\nČR']].rename(columns={'CZ0\nČR': 'VALUE'})

czech_population_female = get_czech_population_gender_all('../input/the-czech-republic-age-gender-demographics-2018/CZ_demographics_women_2018.csv')
czech_population_male = get_czech_population_gender_all('../input/the-czech-republic-age-gender-demographics-2018/CZ_demographics_men_2018.csv')

Let's compute general population proportions per age groups we've introduced for COVID-19 patients.

In [68]:
def derive_age_category_cz(df, age_start, age_final):
    """
    The function computes counts in the general Czech population proportions per age groups found in data on Czech COVID-19 patients.
    The computed record is appended to the general population DataFrame.
    
    @param df: The general population pandas DataFrame
    @param age_start: lower bound of the new age group interval (included)
    @param age_final: upper bound of the new age group interval (excluded)
    @return: the general population pandas DataFrame updated with the newly computed category
    """
    row_val = 0
    i = 0 
    while age_start + i < age_final:
        if np.any(df['Age'] == age_start + i):
            row_val += df.loc[df['Age'] == age_start + i, 'VALUE'].values[0]
        i += 1
    row_df = pd.DataFrame({'gender': [df['gender'].values[0]], 'age_group': [f'[{age_start}, {age_final})'], 'VALUE': [row_val]})
    return df.append(row_df, ignore_index=True)


czech_age_groups = df_czechia['age_group'].map(str).unique()
czech_population_male['gender'] = 'Male'
czech_population_female['gender'] = 'Female'
for age_group_name in tqdm(czech_age_groups, desc='Deriving age groups..'):
    czech_population_female = derive_age_category_cz(czech_population_female, *map(int, age_group_name[1:-1].split(', ')))
    czech_population_male = derive_age_category_cz(czech_population_male, *map(int, age_group_name[1:-1].split(', ')))
czech_population_joined = pd.concat((czech_population_female, czech_population_male), sort=False)
the_whole_population_size = czech_population_joined.loc[czech_population_joined['Age'] == -1, 'VALUE'].sum()

df_proportions_czechia['True population proportion [%]'] = np.nan
for index, row in df_proportions_czechia.iterrows():
    true_population_count = czech_population_joined.loc[(czech_population_joined['age_group'] == row['age_group']) &
                                                         (czech_population_joined['gender'] == row['gender']) , 'VALUE'].values[0]
    df_proportions_czechia.loc[index, 'True population proportion [%]'] = 100 * true_population_count / the_whole_population_size
C:\Users\H184146\AppData\Local\conda\conda\envs\ds2\lib\site-packages\pandas\core\frame.py:6692: FutureWarning:

Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.



back to Content

COVID-19 age-group proportions vs. general population

In [69]:
def compare_proportions_visually(df, gender, title='', covid_col='Sample proportion [%]', return_pivoted_table=False):
    """
    The function prepare data for visual comparison of age-gender group proportions
    between COVID-19 patients and general population.
    By default the function creates a comparative plot.
    For a custom comparative visualization one can specify return_pivoted_table=True,
    in this case data for comparative visualization are returned without plotting default visualization.
    
    @param df: a pandas DataFrame with population and COVID-19 proportions, containing columns ['age_group', 'gender', <covid_col>, 'True population proportion [%]']
    @param gender: a gender visual comparison must be prepared for
    @param title: a title for default visualization
    @param covid_col: a column name with COVID-19 proportion (can 'Sample proportion [%]' or confidence interval lb)
    @param return_pivoted_table: a boolean flag to return prepared data for custom visualization
    @return: None when return_pivoted_table=False, otherwise a pandas DataFrame containing columns 'COVID-19', and 'General Population', indexed by age groups
    """
    sample_proportion = df.loc[df['gender'] == gender, 
                               ['age_group', covid_col]].rename(columns={covid_col: 'Proportion'})
    sample_proportion['Dataset'] = 'COVID-19'
    population_proportion = df.loc[df['gender'] == gender,
                                   ['age_group', 'True population proportion [%]']].rename(columns={'True population proportion [%]': 'Proportion'})
    population_proportion['Dataset'] = 'General Population'
    gender_proportions = pd.concat((sample_proportion, population_proportion), sort=False)
    gender_proportions_pivoted_table = pd.pivot_table(gender_proportions,
                                                      index='age_group', 
                                                      columns=['Dataset'], 
                                                      values='Proportion')
    if return_pivoted_table:
        return gender_proportions_pivoted_table
    gender_proportions_pivoted_table.plot.bar(rot=30, figsize=(12, 7))
    plt.xlabel('Age Group [years interval]')
    plt.ylabel('Proportion [%]')
    plt.title(title, fontsize=18)
    
compare_proportions_visually(df_proportions_czechia, 'Female', 
                             'Czechia, Females: COVID-19 proportions vs. Population proportions')
In [70]:
compare_proportions_visually(df_proportions_czechia, 'Male', 
                             'Czechia, Males: COVID-19 proportions vs. Population proportions')

An interesting observation

We can see that in Czechia we seemingly manage to protect older women such that there're less of them among COVID-19 patients compared to the population proportion. However, it doesn't hold for older men, even though the state policies are the same (like in the morning only older people are allowed to visit groceries/post office). I believe it's because Czech men are true gentlemen protecting their women whenever possible. Another possible explanation might be that older men are harder to help to compared to older women.

Note the difference in proportion of young people. As this group is an obvious outliner (either due to being more resilient, or due to being more carefully protected, or both), let's also check proportions among population older than 20.

In [71]:
def get_20_plus_proportions(df_proportions, youth_category='[0, 20)'):
    """
    A function returns age category proportions in population excluding 'youth_category'

    @param df_proportions: a pandas DataFrame with population and COVID-19 proportions, containing columns ['age_group', 'gender', <covid_col>, 'True population proportion [%]']
    @param youth_category: a category in the 'age_group' column to be excluded
    @return: a pandas DataFrame with proportions recomputed after excluding 'youth_category'
    """
    df_proportions_20plus = df_proportions[df_proportions['age_group'] != youth_category].copy()
    df_proportions_20plus['Sample proportion [%]'] = 100*df_proportions_20plus['Sample proportion [%]'] / df_proportions_20plus['Sample proportion [%]'].sum()
    df_proportions_20plus['True population proportion [%]'] = 100*df_proportions_20plus['True population proportion [%]'] / df_proportions_20plus['True population proportion [%]'].sum()
    return df_proportions_20plus

df_proportions_czechia_20plus = get_20_plus_proportions(df_proportions_czechia)
In [72]:
compare_proportions_visually(df_proportions_czechia_20plus, 'Female', 
                             'Czechia, Females over 20: COVID-19 proportions vs. Population proportions')
In [73]:
compare_proportions_visually(df_proportions_czechia_20plus, 'Male', 
                             'Czechia, Males over 20: COVID-19 proportions vs. Population proportions')

back to Content

Pearson's chi-squared test

Now we're going to test null hypothesis that COVID-19 attacks any person at random, not taking age-gender grouping into account.

A friend of mine recommended me to provide some pointers for those, who'd like to refresh theoretical basics. Here one can find nice lecture notes: from Stanford's "Introduction to Statistical Methods".

And here's a short video from Khan academy:

In [74]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/2QeDRsxSF9M?rel=0&amp;controls=1&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>

We have verified that each age category in the COVID-19 samples from both Czechia and Canada has at least 10 points, so let's proceed with the testing.

All categories

In [75]:
chisquare((df_proportions_czechia['Sample proportion [%]']*sample_size_cz/100).map(int).values,
          (df_proportions_czechia['True population proportion [%]']*sample_size_cz/100).map(int).values)
Out[75]:
Power_divergenceResult(statistic=815.9484405062602, pvalue=1.8181791251904143e-162)

Females: all ages

In [76]:
chisquare((df_proportions_czechia.loc[df_proportions_czechia['gender'] == 'Female', 'Sample proportion [%]']*sample_size_cz/100).map(int).values,
          (df_proportions_czechia.loc[df_proportions_czechia['gender'] == 'Female', 'True population proportion [%]']*sample_size_cz/100).map(int).values)
Out[76]:
Power_divergenceResult(statistic=521.4740111275132, pvalue=1.7330873818456995e-107)

Females over 20

In [77]:
chisquare((df_proportions_czechia_20plus.loc[df_proportions_czechia_20plus['gender'] == 'Female', 'Sample proportion [%]']*sample_size_cz/100).map(int).values,
          (df_proportions_czechia_20plus.loc[df_proportions_czechia_20plus['gender'] == 'Female', 'True population proportion [%]']*sample_size_cz/100).map(int).values)
Out[77]:
Power_divergenceResult(statistic=259.59309830701756, pvalue=2.512101343638041e-52)

Males: all ages

In [78]:
chisquare((df_proportions_czechia.loc[df_proportions_czechia['gender'] == 'Male', 'Sample proportion [%]']*sample_size_cz/100).map(int).values,
          (df_proportions_czechia.loc[df_proportions_czechia['gender'] == 'Male', 'True population proportion [%]']*sample_size_cz/100).map(int).values)
Out[78]:
Power_divergenceResult(statistic=294.4744293787469, pvalue=6.172628954118387e-59)

Males over 20

In [79]:
chisquare((df_proportions_czechia_20plus.loc[df_proportions_czechia_20plus['gender'] == 'Male', 'Sample proportion [%]']*sample_size_cz/100).map(int).values,
          (df_proportions_czechia_20plus.loc[df_proportions_czechia_20plus['gender'] == 'Male', 'True population proportion [%]']*sample_size_cz/100).map(int).values)
Out[79]:
Power_divergenceResult(statistic=67.85203808316513, pvalue=4.009469839319972e-12)

..Here goes another stats meme

In [80]:
%%HTML
<img src=https://i.pinimg.com/originals/e5/46/4b/e5464b76f64560a133d2827f7bb9eec5.jpg width="310" align="left">

We can see that even when focusing on females/males over 20, p-values are very low.

In our case p-value is probability of getting our result or even more extreme one (more different distributions) when COVID-19 attacks all age groups evenly/at random. Or in other words it's probability of rejecting hypothesis that COVID-19 attacks every age group evenly while in fact it's true (type I error).

Let's now separately test each age/gender group. We've already performed several tests not controlling increase in familywise error rate, considering our investigations preliminary. However, now we're going to perform 18 tests. Let's correct p-values using Holm correction.

In [81]:
df_proportions_czechia['p-value'] = np.nan
for index, row in df_proportions_czechia.iterrows():
    covid_group_frequency = df_proportions_czechia.loc[index, 'Sample proportion [%]']*sample_size_cz/100 
    covid_others_frequency = sample_size_cz - covid_group_frequency
    population_group_frequency = df_proportions_czechia.loc[index, 'True population proportion [%]']*sample_size_cz/100 
    population_others_frequency = sample_size_cz - population_group_frequency
    # check normality: at least 10 cases in each category
    if min([covid_group_frequency, covid_others_frequency, population_group_frequency, population_others_frequency]) >= 10:
        df_proportions_czechia.loc[index, 'p-value'] = chisquare([covid_group_frequency, covid_others_frequency],
                                                             [population_group_frequency, population_others_frequency])[1]
df_proportions_czechia['H0_rejected'] = False
df_proportions_czechia.loc[~df_proportions_czechia['p-value'].isnull(), 'H0_rejected'] = multipletests(df_proportions_czechia.loc[~df_proportions_czechia['p-value'].isnull(), 'p-value'])[0]
In [82]:
df_proportions_czechia['Ho_rejected_bonferroni'] = df_proportions_czechia['p-value']*sum(~df_proportions_czechia['p-value'].isnull()) < 0.05
In [83]:
df_proportions_czechia[df_proportions_czechia['H0_rejected']]
Out[83]:
age_group gender Sample proportion [%] True population proportion [%] p-value H0_rejected Ho_rejected_bonferroni
0 [0, 20) Female 4.068264 9.790967 1.062705e-48 True True
1 [0, 20) Male 4.619893 10.311445 4.199319e-46 True True
2 [20, 30) Female 8.102051 5.536522 1.292844e-17 True True
3 [20, 30) Male 7.291846 5.799536 1.157311e-06 True True
6 [40, 50) Female 11.101534 7.688659 1.720726e-22 True True
7 [40, 50) Male 10.032753 8.107357 7.757182e-08 True True
8 [50, 60) Female 8.222720 6.161259 6.585208e-11 True True
9 [50, 60) Male 7.102224 6.259883 8.085923e-03 True False
10 [60, 70) Female 5.223237 6.872339 6.874932e-07 True True
12 [70, 80) Female 3.861403 5.027071 4.840334e-05 True True
14 [80, 90) Female 2.809860 2.268640 5.633607e-03 True False
15 [80, 90) Male 1.913463 1.203792 7.182670e-07 True True
16 [90, 110) Female 1.189450 0.416878 6.689300e-20 True True

Canada population data

Preprocessing the general population data to have it in a format compatable with the COVID-19 data.

In [84]:
canada_population = pd.read_csv('../input/canada-population/17100005.csv', low_memory=False)
canada_population = canada_population.rename(columns={'Sex': 'gender', 'Age group': 'age_group'})
canada_population_female = canada_population.loc[(canada_population['REF_DATE'] == 2019) &
                                                 (canada_population['gender'] == 'Females') &
                                                 (canada_population['GEO'] == 'Canada'), ['gender', 'age_group', 'VALUE']]
canada_population_male = canada_population.loc[(canada_population['REF_DATE'] == 2019) &
                                               (canada_population['gender'] == 'Males') &
                                               (canada_population['GEO'] == 'Canada'), ['gender', 'age_group', 'VALUE']]

def derive_age_category_canada(df, age_start, age_final):
    """
    The function computes counts in the general Canada population proportions per age groups found in data on Canada COVID-19 patients.
    The computed record is appended to the general population DataFrame.
    
    @param df: The general population pandas DataFrame
    @param age_start: lower bound of the new age group interval (included)
    @param age_final: upper bound of the new age group interval (excluded)
    @return: the general population pandas DataFrame updated with the newly computed category
    """
    row_val = 0
    i = 0
    while age_start + i*5 < age_final:
        row_val += df.loc[df['age_group'] == f'{age_start + i*5} to {age_start + i*5 + 4} years', 'VALUE'].values[0]
        i += 1
    row_df = pd.DataFrame({'gender': [df['gender'].values[0]], 'age_group': [f'{age_start}-{age_final}'], 'VALUE': [row_val]})
    return df.append(row_df, ignore_index=True)


for age_group_name in tqdm(numeric_age_groups, desc='Unifying age groups..'):
    canada_population_female = derive_age_category_canada(canada_population_female, *map(int, age_group_name.split('-')))
    canada_population_male = derive_age_category_canada(canada_population_male, *map(int, age_group_name.split('-')))
canada_population_joined = pd.concat((canada_population_female, canada_population_male))
the_whole_population_size = canada_population_joined.loc[canada_population_joined['age_group'] == 'All ages', 'VALUE'].sum()

canada_population_joined['True population proportion [%]'] = np.nan
for index, row in df_proportions_canada.iterrows():
    if row['gender'] == 'both-genders':
        true_population_count = canada_population_joined.loc[canada_population_joined['age_group'] == row['age_group'], 'VALUE'].sum()
    else:
        true_population_count = canada_population_joined.loc[(canada_population_joined['age_group'] == row['age_group']) &
                                                             (canada_population_joined['gender'] == row['gender']) , 'VALUE'].values[0]
    df_proportions_canada.loc[index, 'True population proportion [%]'] = 100 * true_population_count / the_whole_population_size

back to Content

COVID-19 age-group proportions vs. general population*

In [85]:
compare_proportions_visually(df_proportions_canada, 'Females', 
                             'Canada, Females: COVID-19 proportions vs. Population proportions')
In [86]:
compare_proportions_visually(df_proportions_canada, 'Males', 
                             'Canada, Males: COVID-19 proportions vs. Population proportions')
In [87]:
df_proportions_canada_20plus = get_20_plus_proportions(df_proportions_canada, youth_category='0-19')
compare_proportions_visually(df_proportions_canada_20plus, 'Females', 
                             'Canada, Females over 20: COVID-19 proportions vs. Population proportions')
In [88]:
compare_proportions_visually(df_proportions_canada_20plus, 'Males', 
                             'Canada, Males over 20: COVID-19 proportions vs. Population proportions')

back to Content

Pearson's chi-squared test*

Females: all ages

In [89]:
chisquare((df_proportions_canada.loc[df_proportions_canada['gender'] == 'Females', 'Sample proportion [%]']*sample_size_canada/100).map(int).values,
          (df_proportions_canada.loc[df_proportions_canada['gender'] == 'Females', 'True population proportion [%]']*sample_size_canada/100).map(int).values)
Out[89]:
Power_divergenceResult(statistic=125.63797157184862, pvalue=5.1175320779465806e-24)

Females over 20

In [90]:
chisquare((df_proportions_canada_20plus.loc[df_proportions_canada_20plus['gender'] == 'Females', 'Sample proportion [%]']*sample_size_canada/100).map(int).values,
          (df_proportions_canada_20plus.loc[df_proportions_canada_20plus['gender'] == 'Females', 'True population proportion [%]']*sample_size_canada/100).map(int).values)
Out[90]:
Power_divergenceResult(statistic=18.386767701284267, pvalue=0.005334953320274743)

Males: all ages

In [91]:
chisquare((df_proportions_canada.loc[df_proportions_canada['gender'] == 'Males', 'Sample proportion [%]']*sample_size_canada/100).map(int).values,
          (df_proportions_canada.loc[df_proportions_canada['gender'] == 'Males', 'True population proportion [%]']*sample_size_canada/100).map(int).values)
Out[91]:
Power_divergenceResult(statistic=133.16272857955295, pvalue=1.37167491732604e-25)

Males over 20

In [92]:
chisquare((df_proportions_canada_20plus.loc[df_proportions_canada_20plus['gender'] == 'Males', 'Sample proportion [%]']*sample_size_canada/100).map(int).values,
          (df_proportions_canada_20plus.loc[df_proportions_canada_20plus['gender'] == 'Males', 'True population proportion [%]']*sample_size_canada/100).map(int).values)
Out[92]:
Power_divergenceResult(statistic=20.637538607501963, pvalue=0.0021308878009418785)

We obtain the same conclusion for Canada, rejecting the null hypothesis about random age groups targeting by COVID-19 with significance levels under 0.005.

Let's now examine each age group separately.

back to Content

Groups at risk of contracting COVID-19 based on data

Let's illustrate how I'm going to do it on one concrete example. My mother belongs to Women between 50-59, I'll use this group for running example. Steps for quantative estimation:

  1. We estimate population proportion of Women between 50-59 based on COVID-19 data. Let's say, 99% confidence interval for the group proportion is 9.1-16.9%. It's an expected population proportion interval if the virus attacks this particular age/gender group at random.

  2. Then we compute actual proportion of women between 50-59 in the population. Let's say, the actual proportion to be 7.0%.

  3. In the running example, the population proportion is less than the COVID-19 proportion by at least 2.1%.

... and here goes a meme related to confidence intervals..

In [93]:
%%HTML
<img src=https://i.imgflip.com/uzpm3.jpg width="210" align="left">

As a reference refresher about the conditions which should be met before estimating confidence intervals for proportions, please, feel free to check this link, or the following video from Khan academy:

In [94]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/8E8YQY5qE3s?rel=0&amp;controls=1&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>

To sum up, to have valid estimation of population proportion CI, the following must be true:

  1. Sample is random
  2. Sampling distribution of proportions has roughly normal shape
    • rule of thumb to check: we have at least 10 points in each category. Which is true in our case.
  3. Cases are independent
    • rule of thumb to check: sample size is at most 10% of the whole population. Which is also true for both Czechia and Canada.

We needed to ensure points 2. and 3., because non-randomeness must be the only factor which can break our proportion estimates. Then if based on positive COVID-19 cases we obtain an estimate for population proportion of women between 50-59 that is higher than the general population proportion, then obviously the virus picks women between 50-59 more often than at random.

back to Content

Estimating confidence intervals

Let's estimate 99% confidence intervals for each age-sex proportion.

Let's compute both simultaneous confidence interval estimations for multinomial proportions, and confidence intervals for binomial proportions, e.g. group female50_59 vs. all other groups.

In [95]:
def estimate_proportions_conf_intervals(df_proportions, sample_size):
    """
    A function estimates confidence intervals for population proportion of each age-gender category based on COVID-19 data
    Both multinomial and binomial proportion CI's are being computed

    @param df_proportions: a pandas DataFrame with population and COVID-19 proportions, containing columns ['age_group', 'gender', <covid_col>, 'True population proportion [%]']
    @param sample_size: number of all COVID-19 patients with age-gender records
    @return: a pandas DataFrame with proportions updated with confidence interval columns
    """
    binomial_population_proportion_lb_list = []
    binomial_population_proportion_ub_list = []
    age_gender_group_sizes = []

    for index, row in df_proportions.iterrows():
        age_gender_group_size = row['Sample proportion [%]'] * sample_size/100
        age_gender_group_sizes.append(age_gender_group_size)
        conf_interval_lb, conf_interval_ub = proportion_confint(age_gender_group_size, sample_size, alpha=0.01)
        binomial_population_proportion_lb_list.append(100*conf_interval_lb)
        binomial_population_proportion_ub_list.append(100*conf_interval_ub)

    conf_intervals = multinomial_proportions_confint(age_gender_group_sizes, alpha=0.01)
    population_proportion_lb_list = list(100*conf_intervals[:, 0])
    population_proportion_ub_list = list(100*conf_intervals[:, 1])
    df_proportions['Population proportion CI LB [%]'] = population_proportion_lb_list
    df_proportions['Population proportion CI UB [%]'] = population_proportion_ub_list
    df_proportions['Binomial population proportion CI LB [%]'] = binomial_population_proportion_lb_list
    df_proportions['Binomial population proportion CI UB [%]'] = binomial_population_proportion_ub_list
    return df_proportions

df_proportions_canada = estimate_proportions_conf_intervals(df_proportions_canada, sample_size_canada)
df_proportions_czechia = estimate_proportions_conf_intervals(df_proportions_czechia, sample_size_cz)

back to Content

Comparison of proportions between COVID-19 cases and general population

The lower the difference between the COVID-19 proportion and the general population proportion, the lower the risk of contracting COVID-19. These cases are depicted with green bars.

On the other hand, the higher the difference, the higher the risk of contracting COVID-19. It's depicted in red.

Females: Visual comparison per age groups

In [96]:
def plot_increase_in_covid_data(gender_proportions_pivoted_table, title):
    """
    Visualization of the proportion increase in COVID-19 data vs. general population data, separately per each age group

    @param gender_proportions_pivoted_table: pandas DataFrame with precomputed data for visual comparison between population and COVID-19 proportions, 
                                            the DataFrame contains columns 'COVID-19', and 'General Population', indexed by age groups
    @param title: title of visualization
    """
    gender_proportions_pivoted_table['Increase in COVID-19 proportion'] = gender_proportions_pivoted_table['COVID-19'] - gender_proportions_pivoted_table['General Population']
    gender_proportions_pivoted_table['Increase in COVID-19 proportion'] = gender_proportions_pivoted_table['Increase in COVID-19 proportion'].map(lambda x: 0 if x<=0 else x)

    plt.figure(figsize=(12, 7))

    xticks = list(range(len(gender_proportions_pivoted_table)))
    plt.bar(xticks, gender_proportions_pivoted_table['General Population'], color='darkorange', edgecolor='white', label='General Population')
    plt.bar(xticks, gender_proportions_pivoted_table['Increase in COVID-19 proportion'], bottom=gender_proportions_pivoted_table['General Population'], 
            color='darkred', edgecolor='white', label='Increase in COVID-19 proportion\n(HIGHER EXPOSURE TO COVID-19)')

    plt.xticks(xticks, gender_proportions_pivoted_table.index)
    plt.xlabel('Age Group')

    plt.legend(loc='upper left', bbox_to_anchor=(1,1))
    plt.title(f'{title}', fontsize=18)

plot_increase_in_covid_data(compare_proportions_visually(df_proportions_czechia, 'Female', return_pivoted_table=True), 
                            title='Czechia, Females: proportion increase in COVID-19 data')

plot_increase_in_covid_data(compare_proportions_visually(df_proportions_canada, 'Females', return_pivoted_table=True), 
                            title='Canada, Females: proportion increase in COVID-19 data')

back to Content

Males: Visual comparison per age groups

In [97]:
plot_increase_in_covid_data(compare_proportions_visually(df_proportions_czechia, 'Male',
                                                         return_pivoted_table=True), 
                            title='Czechia, Males: proportion increase in COVID-19 data')

plot_increase_in_covid_data(compare_proportions_visually(df_proportions_canada, 'Males', 
                                                         return_pivoted_table=True), 
                            title='Canada, Males: proportion increase in COVID-19 data')

back to Content

Summary: comparison for all gender/sex groups

Here we also visualize in a column cases when COVID-19 proportion is lower than population proportion. Such 'safer' groups have green color.

In [98]:
def display_tabular_diffs_cz(col_name='Sample proportion [%]', col_name_2=None, attribute_name='Czechia, Difference between COVID-19 cases proportion and Population proportion [%]'):
    if col_name_2 is None:
        df_proportions_czechia[attribute_name] = (df_proportions_czechia[col_name] - 
                                              df_proportions_czechia['True population proportion [%]'])
    else:
        proportion_diffs = df_proportions_czechia['Sample proportion [%]'] - df_proportions_czechia['True population proportion [%]']
        df_proportions_czechia[attribute_name] = np.nan
        lb_diff = df_proportions_czechia[col_name] - df_proportions_czechia['True population proportion [%]']
        df_proportions_czechia.loc[proportion_diffs > 0, attribute_name] = lb_diff[proportion_diffs > 0].map(lambda x: max(0, x))        
        ub_diff = df_proportions_czechia[col_name_2] - df_proportions_czechia['True population proportion [%]']
        df_proportions_czechia.loc[proportion_diffs < 0, attribute_name] = ub_diff[proportion_diffs < 0].map(lambda x: min(0, x)) 
    display((df_proportions_czechia[['gender', 'age_group', attribute_name]]
     .sort_values(by=['gender', 'age_group'])
     .style.bar(subset=[attribute_name], align='mid', color=['#5fba7d', '#d65f5f'])
     .set_properties(subset=[attribute_name], **{'width': '490px'})))

    df_proportions_czechia.loc[~df_proportions_czechia['H0_rejected'], attribute_name] = 0
    prepated_data = (df_proportions_czechia.loc[(df_proportions_czechia['gender'] == 'Male'), ['gender', 'age_group', attribute_name]]
                     .sort_values(by=['gender', 'age_group']))
    display(prepated_data.style.bar(subset=[attribute_name], align='mid', color=['#5fba7d', '#d65f5f']).set_properties(subset=[attribute_name], **{'width': '520px'}))

    df_proportions_czechia.loc[~df_proportions_czechia['H0_rejected'], attribute_name] = 0
    prepated_data = (df_proportions_czechia.loc[(df_proportions_czechia['gender'] == 'Female'), ['gender', 'age_group', attribute_name]]
                     .sort_values(by=['gender', 'age_group']))
    display(prepated_data.style.bar(subset=[attribute_name], align='mid', color=['#5fba7d', '#d65f5f']).set_properties(subset=[attribute_name], **{'width': '520px'}))
display_tabular_diffs_cz()
gender age_group Czechia, Difference between COVID-19 cases proportion and Population proportion [%]
0 Female [0, 20) -5.7227
2 Female [20, 30) 2.56553
4 Female [30, 40) -0.266325
6 Female [40, 50) 3.41287
8 Female [50, 60) 2.06146
10 Female [60, 70) -1.6491
12 Female [70, 80) -1.16567
14 Female [80, 90) 0.54122
16 Female [90, 110) 0.772572
1 Male [0, 20) -5.69155
3 Male [20, 30) 1.49231
5 Male [30, 40) 0.0426196
7 Male [40, 50) 1.9254
9 Male [50, 60) 0.842341
11 Male [60, 70) -0.405931
13 Male [70, 80) 0.277954
15 Male [80, 90) 0.709671
17 Male [90, 110) 0.257332
gender age_group Czechia, Difference between COVID-19 cases proportion and Population proportion [%]
1 Male [0, 20) -5.69155
3 Male [20, 30) 1.49231
5 Male [30, 40) 0
7 Male [40, 50) 1.9254
9 Male [50, 60) 0.842341
11 Male [60, 70) 0
13 Male [70, 80) 0
15 Male [80, 90) 0.709671
17 Male [90, 110) 0
gender age_group Czechia, Difference between COVID-19 cases proportion and Population proportion [%]
0 Female [0, 20) -5.7227
2 Female [20, 30) 2.56553
4 Female [30, 40) 0
6 Female [40, 50) 3.41287
8 Female [50, 60) 2.06146
10 Female [60, 70) -1.6491
12 Female [70, 80) -1.16567
14 Female [80, 90) 0.54122
16 Female [90, 110) 0.772572
In [99]:
attribute_name = 'Canada, Difference between COVID-19 cases proportion and Population proportion [%]'
df_proportions_canada[attribute_name] = (df_proportions_canada['Sample proportion [%]'] - 
                                         df_proportions_canada['True population proportion [%]'])
(df_proportions_canada[['gender', 'age_group', attribute_name]]
 .sort_values(by=['gender', 'age_group'])
 .style.bar(subset=[attribute_name], align='mid', color=['#5fba7d', '#d65f5f'])
 .set_properties(subset=[attribute_name], **{'width': '490px'}))
Out[99]:
gender age_group Canada, Difference between COVID-19 cases proportion and Population proportion [%]
0 Females 0-19 -9.34189
2 Females 20-29 0.21733
4 Females 30-39 1.83671
6 Females 40-49 -0.224539
8 Females 50-59 4.53303
10 Females 60-69 2.42423
12 Females 70-79 0.582821
14 Females 80-89 -0.14327
1 Males 0-19 -9.82978
3 Males 20-29 -1.31637
5 Males 30-39 0.425392
7 Males 40-49 2.03028
9 Males 50-59 2.65976
11 Males 60-69 3.51604
13 Males 70-79 0.879465
15 Males 80-89 0.781381
16 both-genders 90-99 0.998123

back to Content

Conservative Detection of age/gender groups with higher risk of contracting COVID-19

Based on comparison of COVID-19 95% confidence interval lower level vs. population proportion.

This indicates groups definitely in danger based on data.

In [100]:
display_tabular_diffs_cz(col_name='Population proportion CI LB [%]', 
                         col_name_2='Population proportion CI UB [%]', 
                         attribute_name='Czechia, Difference between COVID-19 confidence interval bounds and Population proportion [%]')
gender age_group Czechia, Difference between COVID-19 confidence interval bounds and Population proportion [%]
0 Female [0, 20) -4.72901
2 Female [20, 30) 1.41286
4 Female [30, 40) 0
6 Female [40, 50) 2.06785
8 Female [50, 60) 0.900227
10 Female [60, 70) -0.545604
12 Female [70, 80) -0.193479
14 Female [80, 90) 0
16 Female [90, 110) 0.371647
1 Male [0, 20) -4.64341
3 Male [20, 30) 0.399279
5 Male [30, 40) 0
7 Male [40, 50) 0.644439
9 Male [50, 60) 0
11 Male [60, 70) 0
13 Male [70, 80) 0
15 Male [80, 90) 0.180126
17 Male [90, 110) 0.0568497
gender age_group Czechia, Difference between COVID-19 confidence interval bounds and Population proportion [%]
1 Male [0, 20) -4.64341
3 Male [20, 30) 0.399279
5 Male [30, 40) 0
7 Male [40, 50) 0.644439
9 Male [50, 60) 0
11 Male [60, 70) 0
13 Male [70, 80) 0
15 Male [80, 90) 0.180126
17 Male [90, 110) 0
gender age_group Czechia, Difference between COVID-19 confidence interval bounds and Population proportion [%]
0 Female [0, 20) -4.72901
2 Female [20, 30) 1.41286
4 Female [30, 40) 0
6 Female [40, 50) 2.06785
8 Female [50, 60) 0.900227
10 Female [60, 70) -0.545604
12 Female [70, 80) -0.193479
14 Female [80, 90) 0
16 Female [90, 110) 0.371647

Females

In [101]:
data_input = compare_proportions_visually(df_proportions_czechia, 'Female',
                                          covid_col='Binomial population proportion CI LB [%]',
                                          return_pivoted_table=True)
                   
plot_increase_in_covid_data(data_input, 
                            title='Czechia, Females: conservative COVID-19 proportion increase (Binomial CI)')
In [102]:
data_input = compare_proportions_visually(df_proportions_czechia, 'Female',
                                             covid_col='Population proportion CI LB [%]',
                                             return_pivoted_table=True)
                   
plot_increase_in_covid_data(data_input, 
                            title='Czechia, Females: conservative COVID-19 proportion increase')
In [103]:
data_input = compare_proportions_visually(df_proportions_canada, 'Females', '',
                                             covid_col='Binomial population proportion CI LB [%]',
                                             return_pivoted_table=True)
                   
plot_increase_in_covid_data(data_input, 
                            title='Canada, Females: conservative COVID-19 proportion increase (Binomial CI)')
In [104]:
data_input = compare_proportions_visually(df_proportions_canada, 'Females',
                                             covid_col='Population proportion CI LB [%]',
                                             return_pivoted_table=True)
                   
plot_increase_in_covid_data(data_input, 
                            title='Canada, Females: conservative COVID-19 proportion increase')
In [105]:
data_input = compare_proportions_visually(df_proportions_czechia, 'Male', '',
                                             covid_col='Binomial population proportion CI LB [%]',
                                             return_pivoted_table=True)
                   
plot_increase_in_covid_data(data_input, 
                            title='Czechia, Males: conservative COVID-19 proportion increase (Binomial CI)')
In [106]:
data_input = compare_proportions_visually(df_proportions_czechia, 'Male', '',
                                             covid_col='Population proportion CI LB [%]',
                                             return_pivoted_table=True)
                   
plot_increase_in_covid_data(data_input, 
                            title='Czechia, Males: conservative COVID-19 proportion increase')
In [107]:
data_input = compare_proportions_visually(df_proportions_canada, 'Males', '',
                                             covid_col='Binomial population proportion CI LB [%]',
                                             return_pivoted_table=True)
                   
plot_increase_in_covid_data(data_input, 
                            title='Canada, Males: conservative COVID-19 proportion increase (Binomial CI)')
In [108]:
data_input = compare_proportions_visually(df_proportions_canada, 'Males', '',
                                             covid_col='Population proportion CI LB [%]',
                                             return_pivoted_table=True)
                   
plot_increase_in_covid_data(data_input, 
                            title='Canada, Males: conservative COVID-19 proportion increase')

We can see that multinomial confidence interval estimation, which deals with uncertainty in all categories simultaneously, provide more conservative CI estimation.

back to Content

Conclusions

Using available data from The Czech Republice, we've demonstrated how to perform in-depth drill down, answering various questions with visualizations.

In [109]:
%%HTML
<img src=https://www.elklan.co.uk/images/blogs/thousand_words.jpg width="140" align="left">

Some insights:

  • for both Czech females and males, age groups between 20–29, 40–49, and 50–59 have significantly higher risk of being exposed to COVID-19 than at random.
    • What's the likely reason? -> these are active groups which have to go to work even during the official state of emergency
    • Women and men between 30–39 doesn't have higher risk. Why? -> It was a puzzling observation at first. Then, I checked that average age of Czech mothers is 30.1 years at birth (source in Czech). It means that among 30–39 year-old people there are many parents of young children. Kindergartens, as well as schools are closed, and parents often must stay at home to take care of children.
In [110]:
%%HTML
<img src=https://pics.me.me/that-moment-when-having-children-starts-to-pay-off-23399548.png width="140" align="left">

A takeaway message for everyone: the age group with the majority of parents, which have to stay at home with children and therefore social-distant themselves more systematically, shows how crucial social-distancing is for COVID-19 prevention. Everyone can play a part!

  • efficient state policies for protecting older population groups seem to have effect, thankfully. However, the policies seem to be more efficient for females. So, for men in their 60's and 70's exposure to COVID-19 is not significantly less than at random, while for women of the same age the exposure is significantly less. It might be worth investigating this fact further and see whether or not it might be due to better discipline among women/staying at home with grandchildren more often/other reason.
  • thankfully, proportion of imported cases is going down justifying the governmental traveling restrictions.
  • it seems that we can be reasonably calm about the youngest of our close ones. I still believe that even people under 19 shouldn't let their guard down, as it is known that people who don't have symptoms can still infect others.
  • the risk groups differ in Canada, there they are more shifted towards older people. The difference might be given by different policies of the Czech and Canadian governments. It might suggest the importance of the governmental policies (like the Czech policy allowing only older population groups to visit groceries, post offices, etc. during the morning hours).

Limitations discussion

  • As it was noted previously, the data about positive COVID-19 cases are likely to be biased towards people who not only contracted COVID-19, but at the same time had moderate, severe, or critical symptoms. In accordance with the agreement in the Data section, throughout the post by "contracting COVID-19/be exposed to COVID-19/ect." we meant "contracting COVID-19 and having moderate, severe, or critical symptoms".
  • Next, the COVID-19 sample in the data is likely to be biased towards people who wanted to be tested themselves. E.g. there might be COVID-19-positive people with moderate symptoms who preferred not to visit health facilities during epidemic and were undocumented, while there might be COVID-19-positive people with mild symptoms who preferred to visit a testing center.
  • Moreover, in the beginning of epidemic testing capacities could have been insufficient and it likely introduced additional biases, e.g. during early weeks there was bias towards cases with critical symptoms. 
  • Next, we've analyzed data about confirmed COVID-19 positive cases. We didn't have info about person-level mortality, or consequences of the virus, which would be very important to analyze.
  • Furthermore, we didn't have information about preexisting health conditions of patients, which also are crucially important to analyze. Overall, I'd view this exploratory data analysis as relatively preliminary and would suggest to repeat and extend it once better person-level data are available.

Take care and stay healthy, everyone!